The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
from scipy.stats import zscore
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# mount Google Drive to upload data from the drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# download, read & assign data to a variable
pd.read_csv('/content/drive/MyDrive/DataSets/stock_data.csv')
stock_data = pd.read_csv('/content/drive/MyDrive/DataSets/stock_data.csv')
Checking the shape of the dataset
#checking the shape of the data
stock_data.shape
(340, 15)
#checking the shape of the data
stock_data.shape
print(f"There are {stock_data.shape[0]} rows and {stock_data.shape[1]} columns.")
There are 340 rows and 15 columns.
Displaying few rows of the dataset
# view the first 5 rows of the train datasets
stock_data.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# view a sample of the data
stock_data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
Checking the data types of the columns for the dataset
# checking the column names and datatypes
stock_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
Creating a copy of original data
# copying the data to another variable to avoid any changes to original data
stock = stock_data.copy()
Checking for duplicates and missing values
# check for duplicate values
stock.duplicated().sum()
0
# check for missing values in the data
stock.isnull().sum()
Ticker Symbol 0 Security 0 GICS Sector 0 GICS Sub Industry 0 Current Price 0 Price Change 0 Volatility 0 ROE 0 Cash Ratio 0 Net Cash Flow 0 Net Income 0 Earnings Per Share 0 Estimated Shares Outstanding 0 P/E Ratio 0 P/B Ratio 0 dtype: int64
Statistical summary of the dataset
stock.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net Cash Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327936 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings Per Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated Shares Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
Observation
The datasates has 15 features, 4 are of categorical datatypes and the remaining 11 are numerical datatypes.
It also have 340 observations for each feature.
There are no missing nor duplicated information in the dataset.
From the Statistical Summary, most of the distibution are right skewed.
Distribution of Stock Price
sns.boxplot(data = stock, x = 'Current Price')
plt.title('Boxplot of Current Price')
plt.xlabel('Current Price')
plt.show();
Distribution of Percentage change in the stock price
sns.boxplot(data = stock, x = 'Price Change')
plt.title('Boxplot of Price Change')
plt.xlabel('Price Change')
plt.show();
Distribution of Volatility
sns.boxplot(data = stock, x = 'Volatility')
plt.title('Boxplot of Volatility')
plt.xlabel('Volatility')
plt.show();
Distribution of ROE
sns.boxplot(data = stock, x = 'ROE')
plt.title('Boxplot of ROE')
plt.xlabel('ROE')
plt.show();
Distribution of Cash Ratio
sns.boxplot(data = stock, x = 'Cash Ratio')
plt.title('Boxplot of Cash Ratio')
plt.xlabel('Cash Ratio')
plt.show();
Distribution of Net Cash Flow
sns.boxplot(data = stock, x = 'Net Cash Flow')
plt.title('Boxplot of Net Cash Flow')
plt.xlabel('Net Cash Flow')
plt.show();
Distribution of Net Income
sns.boxplot(data = stock, x = 'Net Income')
plt.title('Boxplot of Net Income')
plt.xlabel('Net Income')
plt.show();
Distribution of Earnings Per Share
sns.boxplot(data = stock, x = 'Earnings Per Share')
plt.title('Boxplot of Earnings Per Share')
plt.xlabel('Earnings Per Share')
plt.show();
Distribution of Estimated Shares Outstanding
sns.boxplot(data = stock, x = 'Estimated Shares Outstanding')
plt.title('Boxplot of Estimated Shares Outstanding')
plt.xlabel('Estimated Shares Outstanding')
plt.show();
Distribution of P/E Ratio
sns.boxplot(data = stock, x = 'P/E Ratio')
plt.title('Boxplot of P/E Ratio')
plt.xlabel('P/E Ratio')
plt.show();
Distribution of P/B Ratio
sns.boxplot(data = stock, x = 'P/B Ratio')
plt.title('Boxplot of P/B Ratio')
plt.xlabel('P/B Ratio')
plt.show();
Distribution of GICS Sector
sns.countplot(data = stock, x = 'GICS Sector')
plt.title('Distribution of GICS Sector')
plt.xlabel('GICS Sector')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
Distribution of GICS Sub Industry
plt.figure(figsize=(15, 7))
sns.countplot(data = stock, x = 'GICS Sub Industry')
plt.title('Distribution of GICS Sub Industryr')
plt.xlabel('GICS Sub Industry')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
num_col = ['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio', 'Net Cash Flow',
'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding','P/E Ratio', 'P/B Ratio']
# correlation check
plt.figure(figsize=(15, 7))
sns.heatmap(
stock.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap = 'Spectral'
)
plt.show()
scaled_stock = stock.iloc[:,4:].apply(zscore)
sns.pairplot(data = scaled_stock, diag_kind="kde")
plt.show()
Analyzing price increase Distribution Among Different Economic Sectors
plt.figure(figsize=(15,8))
sns.barplot(data = stock, x='Price Change', y='GICS Sector', ci=False)
plt.xticks(rotation=90)
plt.show()
Analyzing Cash ratios Distribution Among Different Economic Sectors
plt.figure(figsize=(15,8))
sns.barplot(data = stock, x='Cash Ratio', y='GICS Sector', ci=False)
plt.xticks(rotation=90)
plt.show()
Analyzing P/E ratios Distribution Among Different Economic Sectors
plt.figure(figsize=(15,8))
sns.barplot(data = stock, x='P/E Ratio', y='GICS Sector', ci=False)
plt.xticks(rotation=90)
plt.show()
Analyzing Volatility Distribution Among Different Economic Sectors
plt.figure(figsize=(15,8))
sns.barplot(data = stock, x='Volatility', y='GICS Sector', ci=False)
plt.xticks(rotation=90)
plt.show()
Observations
There isn't high correlation between the numerical features.
Energy, Healthcare, Telecommunications Services and Informatio Technology are industries doing well with Cash Ratio, Volatility, price changes and P/E Ratios.
plot the boxplots for the numerical features to check for outliers.
plt.figure(figsize=(15, 12))
numeric_columns = stock.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
plt.boxplot(stock[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
scale the data .
# scaling the data before clustering
scaler = StandardScaler()
#subset = stock[num_col].copy()
subset = stock.iloc[:,4:]
subset_scaled = scaler.fit_transform(subset)
# creating a dataframe of the scaled data
subset_scaled_stock = pd.DataFrame(subset_scaled, columns = subset.columns)
Observations
There are lots of outliers present in the dataset.
k_means_stock = subset_scaled_stock.copy()
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_stock)
prediction = model.predict(k_means_stock)
distortion = (
sum(np.min(cdist(k_means_stock, model.cluster_centers_, 'euclidean'), axis=1))
/ k_means_stock.shape[0]
)
meanDistortions.append(distortion)
print('Number of Clusters:', k, '\tAverage Distortion:', distortion)
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average Distortion')
plt.title('Selecting k with the Elbow Method', fontsize=20)
plt.show()
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.2692367155390745 Number of Clusters: 4 Average Distortion: 2.1745559827866363 Number of Clusters: 5 Average Distortion: 2.128799332840716 Number of Clusters: 6 Average Distortion: 2.080400099226289 Number of Clusters: 7 Average Distortion: 2.0289794220177395 Number of Clusters: 8 Average Distortion: 1.964144163389972 Number of Clusters: 9 Average Distortion: 1.9221492045198068 Number of Clusters: 10 Average Distortion: 1.8513913649973124 Number of Clusters: 11 Average Distortion: 1.8024134734578485 Number of Clusters: 12 Average Distortion: 1.7900931879652673 Number of Clusters: 13 Average Distortion: 1.7417609203336912 Number of Clusters: 14 Average Distortion: 1.673559857259703
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
# fit the data to the visualizer
visualizer.fit(k_means_stock)
# finalize and render figure
visualizer.show()
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters = n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_stock))
score = silhouette_score(k_means_stock, preds)
sil_score.append(score)
print('For n_clusters = {}, the silhouette score is {})'.format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.43969639509980457) For n_clusters = 3, the silhouette score is 0.4644405674779404) For n_clusters = 4, the silhouette score is 0.4577225970476733) For n_clusters = 5, the silhouette score is 0.43228336443659804) For n_clusters = 6, the silhouette score is 0.4005422737213617) For n_clusters = 7, the silhouette score is 0.3976335364987305) For n_clusters = 8, the silhouette score is 0.40278401969450467) For n_clusters = 9, the silhouette score is 0.3778585981433699) For n_clusters = 10, the silhouette score is 0.13458938329968687) For n_clusters = 11, the silhouette score is 0.1421832155528444) For n_clusters = 12, the silhouette score is 0.2044669621527429) For n_clusters = 13, the silhouette score is 0.23424874810104204) For n_clusters = 14, the silhouette score is 0.12102526472829901)
13 clusters indicates a high silhouette coefficient, with a values closest to one(1) and at the time having a low variability.
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric = 'silhouette', timings=True)
# fit the data to the visualizer
visualizer.fit(k_means_stock)
# finalize and render figure
visualizer.show()
<Axes: title={'center': 'Silhouette Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='silhouette score'>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(13, random_state=1))
## Complete the code to visualize the silhouette scores for certain number of clusters
visualizer.fit(k_means_stock)
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 13 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
# final K-means model
kmeans = KMeans(n_clusters = 13 , random_state=1)
## Complete the code to choose the number of clusters
kmeans.fit(k_means_stock)
KMeans(n_clusters=13, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=13, random_state=1)
# creating a copy of the original data
stock1 = stock.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_stock['KM_segments'] = kmeans.labels_
stock1['KM_segments'] = kmeans.labels_
km_cluster_profile = stock1.groupby('KM_segments').mean()
km_cluster_profile['count_in_each_segment'] = (
stock1.groupby('KM_segments')['Security'].count().values
)
km_cluster_profile.style.highlight_max(color = 'yellow', axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KM_segments | ||||||||||||
| 0 | 87.750714 | 6.493815 | 1.263817 | 46.523810 | 68.142857 | 535321428.571429 | 5464911904.761905 | 6.613333 | 1109268300.866667 | 17.144309 | -2.236844 | 42 |
| 1 | 62.440873 | 16.496145 | 1.842172 | 28.727273 | 152.424242 | 15584454.545455 | 906482696.969697 | 1.901818 | 693964716.224242 | 38.656467 | 5.729196 | 33 |
| 2 | 71.338384 | 3.754053 | 1.351596 | 21.447761 | 45.179104 | -17599323.383085 | 903304940.298507 | 3.341468 | 299324257.335970 | 23.887399 | -2.398044 | 201 |
| 3 | 429.569995 | 3.761200 | 1.565483 | 18.000000 | 174.400000 | 362405800.000000 | 496954600.000000 | 9.900000 | 59200701.450000 | 55.569866 | 46.627828 | 5 |
| 4 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 5 | 50.744999 | 5.512605 | 1.070356 | 24.875000 | 60.250000 | -3860125000.000000 | 15782875000.000000 | 3.632500 | 4640082627.625000 | 16.598649 | -7.787061 | 8 |
| 6 | 65.364667 | 10.707298 | 1.564626 | 12.000000 | 581.200000 | 336879000.000000 | 25940800.000000 | 1.158000 | 1193582906.920000 | 34.483791 | 9.400072 | 5 |
| 7 | 101.838570 | 0.089939 | 1.772285 | 15.142857 | 46.428571 | 52442714.285714 | 1134066428.571429 | 4.841429 | 155080956.911429 | 19.023144 | -51.728525 | 7 |
| 8 | 35.263847 | -16.175693 | 2.841300 | 49.769231 | 48.153846 | -135215038.461538 | -2525946153.846154 | -6.514231 | 482428533.751538 | 77.817252 | 1.618150 | 26 |
| 9 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 10 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 11 | 108.304002 | 10.737770 | 1.165694 | 566.200000 | 26.600000 | -278760000.000000 | 687180000.000000 | 1.548000 | 349607057.720000 | 34.898915 | -16.851358 | 5 |
| 12 | 327.006671 | 21.917380 | 2.029752 | 4.000000 | 106.000000 | 698240666.666667 | 287547000.000000 | 0.750000 | 366763235.300000 | 400.989188 | -5.322376 | 3 |
for cl in stock1['KM_segments'].unique():
print("In cluster {}, the following companies are present:".format(cl))
print(stock1[stock1['KM_segments'] == cl]['Security'].unique())
print()
In cluster 0, the following companies are present: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'American International Group, Inc.' 'American Express Co' 'Boeing Company' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Chubb Limited' 'Capital One Financial' 'CVS Health' 'Chevron Corp.' 'Delta Air Lines' 'The Walt Disney Company' 'Ford Motor' 'Gilead Sciences' 'General Motors' "Honeywell Int'l Inc." 'Host Hotels & Resorts' 'International Business Machines' 'Lockheed Martin Corp.' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' "McDonald's Corp." 'Mondelez International' 'MetLife Inc.' '3M Company' 'Altria Group Inc' 'Merck & Co.' 'PepsiCo Inc.' 'Philip Morris International' 'PNC Financial Services' 'Prudential Financial' 'The Travelers Companies Inc.' 'Texas Instruments' 'United Continental Holdings' 'United Health Group Inc.' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Valero Energy'] In cluster 1, the following companies are present: ['Adobe Systems Inc' 'Analog Devices, Inc.' 'Applied Materials Inc' 'Amgen Inc' 'Arconic Inc' 'Activision Blizzard' 'Broadcom' 'Baxter International Inc.' 'Bristol-Myers Squibb' 'Celgene Corp.' 'Centene Corporation' 'Du Pont (E.I.)' 'eBay Inc.' 'Edwards Lifesciences' 'Fluor Corp.' 'First Solar Inc' 'Corning Inc.' 'Halliburton Co.' 'Juniper Networks' 'Mattel Inc.' "Moody's Corp" 'Mead Johnson' 'Mylan N.V.' 'Newmont Mining Corp. (Hldg. Co.)' 'Nucor Corp.' 'PayPal' 'Charles Schwab Corporation' 'Skyworks Solutions' 'TripAdvisor' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Wynn Resorts Ltd' 'Zoetis'] In cluster 2, the following companies are present: ['Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'AMETEK Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Aon plc' 'Amphenol Corp' 'AvalonBay Communities, Inc.' 'American Water Works Company Inc' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'Ball Corp' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'Cummins Inc.' 'CMS Energy' 'CenterPoint Energy' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'Dominion Resources' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'The Hershey Company' 'Humana Inc.' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mohawk Industries' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' 'Marathon Petroleum' 'M&T Bank Corp.' 'Mettler Toledo' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Phillips 66' 'Praxair Inc.' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Under Armour' 'UDR Inc' 'Universal Health Services, Inc.' 'Unum Group' 'Varian Medical Systems' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Ventas Inc' 'Wec Energy Group Inc' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp'] In cluster 3, the following companies are present: ['Alliance Data Systems' 'Chipotle Mexican Grill' 'Equinix' 'Intuitive Surgical Inc.' 'Regeneron'] In cluster 11, the following companies are present: ['Allegion' 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark' 'S&P Global, Inc.'] In cluster 12, the following companies are present: ['Alexion Pharmaceuticals' 'Amazon.com Inc' 'Netflix Inc.'] In cluster 7, the following companies are present: ['Affiliated Managers Group Inc' 'Anthem Inc.' 'CME Group Inc.' 'Discovery Communications-A' 'Discovery Communications-C' 'Expedia Inc.' 'Whirlpool Corp.'] In cluster 4, the following companies are present: ['Apache Corporation' 'Chesapeake Energy'] In cluster 8, the following companies are present: ['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'The Mosaic Company' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Spectra Energy Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] In cluster 10, the following companies are present: ['Bank of America Corp' 'Intel Corp.'] In cluster 5, the following companies are present: ['Citigroup Inc.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] In cluster 6, the following companies are present: ['Facebook' 'Frontier Communications' 'Monster Beverage' 'Waters Corporation' 'Yahoo Inc.'] In cluster 9, the following companies are present: ['Priceline.com Inc']
stock1.groupby(['KM_segments', 'GICS Sector'])['Security'].count()
KM_segments GICS Sector
0 Consumer Discretionary 4
Consumer Staples 5
Energy 2
Financials 9
Health Care 6
Industrials 10
Information Technology 3
Materials 1
Real Estate 1
Telecommunications Services 1
1 Consumer Discretionary 3
Consumer Staples 1
Energy 1
Financials 2
Health Care 9
Industrials 2
Information Technology 12
Materials 3
2 Consumer Discretionary 25
Consumer Staples 9
Energy 3
Financials 31
Health Care 19
Industrials 39
Information Technology 11
Materials 14
Real Estate 25
Telecommunications Services 1
Utilities 24
3 Consumer Discretionary 1
Health Care 2
Information Technology 1
Real Estate 1
4 Energy 2
5 Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Telecommunications Services 2
6 Consumer Staples 1
Health Care 1
Information Technology 2
Telecommunications Services 1
7 Consumer Discretionary 4
Financials 2
Health Care 1
8 Energy 21
Industrials 1
Information Technology 2
Materials 2
9 Consumer Discretionary 1
10 Financials 1
Information Technology 1
11 Consumer Discretionary 1
Consumer Staples 2
Financials 1
Industrials 1
12 Consumer Discretionary 1
Health Care 1
Information Technology 1
Name: Security, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle('Boxplot of numerical variables for each cluster')
# selecting numerical columns
num_col = stock.select_dtypes(include=np.number).columns.tolist()
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data = stock1, x = 'KM_segments', y=variable)
plt.tight_layout(pad=2.0)
stock1.groupby(['KM_segments']).mean().plot.bar(figsize=(15,6))
<Axes: xlabel='KM_segments'>
hc_stock = subset_scaled_stock.copy()
# list of distance metrics
distance_metrics = ['euclidean', 'chebyshev', 'mahalanobis', 'cityblock']
# list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'weighted']
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_stock, metric = dm, method = lm)
c, coph_dists = cophenet(Z, pdist(hc_stock))
print(
'Cophenetic correlation for {} distance and {} linkage is {}.'.format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
'Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.'.format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
# list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted']
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_stock, metric = 'euclidean', method=lm)
c, coph_dists = cophenet(Z, pdist(hc_stock))
print('Cophenetic correlation for {} linkage is {}.'.format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = 'euclidean'
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
'Highest cophenetic correlation is {}, which is obtained with {} linkage.'.format(
high_cophenet_corr, high_dm_lm[1]
)
)
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404. **************************************************************************************************** Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
# list of linkage methods
linkage_methods = ['single', 'complete', 'average', 'centroid', 'ward', 'weighted']
# lists to save results of cophenetic correlation calculation
compare_cols = ['Linkage', 'Cophenetic Coefficient']
compare = []
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_stock, metric = 'euclidean', method = method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f'Dendrogram ({method.capitalize()} Linkage)')
coph_corr, coph_dist = cophenet(Z, pdist(hc_stock))
axs[i].annotate(
f'Cophenetic\nCorrelation\n{coph_corr:0.2f}',
(0.80, 0.80),
xycoords = 'axes fraction',
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by = 'Cophenetic Coefficient')
df_cc
| Linkage | Cophenetic Coefficient | |
|---|---|---|
| 4 | ward | 0.710118 |
| 1 | complete | 0.787328 |
| 5 | weighted | 0.869378 |
| 0 | single | 0.923227 |
| 3 | centroid | 0.931401 |
| 2 | average | 0.942254 |
HCmodel = AgglomerativeClustering(n_clusters = 17, affinity = 'euclidean', linkage = 'ward')
HCmodel.fit(hc_stock)
AgglomerativeClustering(affinity='euclidean', n_clusters=17)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=17)
# creating a copy of the original data
stock2 = stock.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_stock['HC_segments'] = HCmodel.labels_
stock2['HC_segments'] = HCmodel.labels_
hc_cluster_profile = stock2.groupby('HC_segments').mean()
hc_cluster_profile['count_in_each_segment'] = (
stock2.groupby('HC_segments')['Security'].count().values
)
hc_cluster_profile.style.highlight_max(color = 'yellow', axis=0)
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_segments | ||||||||||||
| 0 | 24.485001 | -13.351992 | 3.482611 | 802.000000 | 51.000000 | -1292500000.000000 | -19106500000.000000 | -41.815000 | 519573983.250000 | 60.748608 | 1.565141 | 2 |
| 1 | 47.688572 | -7.815632 | 2.388123 | 55.928571 | 59.857143 | -616127357.142857 | -3551885428.571429 | -10.815000 | 364233130.238571 | 87.406407 | 2.650453 | 14 |
| 2 | 51.839641 | 13.061837 | 1.772744 | 20.016949 | 91.711864 | -292614983.050847 | 1332173118.644068 | 1.922203 | 718503003.122203 | 28.315828 | 1.618133 | 59 |
| 3 | 81.525556 | 11.783328 | 1.575565 | 14.166667 | 544.666667 | 349565833.333333 | 1178117333.333333 | 2.490000 | 1121045865.050000 | 31.693323 | 11.835447 | 6 |
| 4 | 62.811401 | 2.555406 | 1.267919 | 26.504000 | 51.496000 | -68401000.000000 | 1296501952.000000 | 3.608960 | 355627111.979680 | 19.946013 | -1.571196 | 125 |
| 5 | 46.672222 | 5.166566 | 1.079367 | 25.000000 | 58.333333 | -3040666666.666667 | 14848444444.444445 | 3.435556 | 4564959946.222222 | 15.596051 | -6.354193 | 9 |
| 6 | 78.074210 | 4.770519 | 1.177668 | 58.157895 | 73.210526 | 1826347368.421053 | 6359636842.105263 | 7.023421 | 992223453.415790 | 14.465865 | -4.748483 | 19 |
| 7 | 118.748113 | 11.286446 | 1.243758 | 23.754717 | 38.547170 | 38667622.641509 | 999939528.301887 | 4.486981 | 277129584.491509 | 36.375839 | -6.052011 | 53 |
| 8 | 108.304002 | 10.737770 | 1.165694 | 566.200000 | 26.600000 | -278760000.000000 | 687180000.000000 | 1.548000 | 349607057.720000 | 34.898915 | -16.851358 | 5 |
| 9 | 1274.949951 | 3.190527 | 1.268340 | 29.000000 | 184.000000 | -1671386000.000000 | 2551360000.000000 | 50.090000 | 50935516.070000 | 25.453183 | -1.052429 | 1 |
| 10 | 101.838570 | 0.089939 | 1.772285 | 15.142857 | 46.428571 | 52442714.285714 | 1134066428.571429 | 4.841429 | 155080956.911429 | 19.023144 | -51.728525 | 7 |
| 11 | 327.006671 | 21.917380 | 2.029752 | 4.000000 | 106.000000 | 698240666.666667 | 287547000.000000 | 0.750000 | 366763235.300000 | 400.989188 | -5.322376 | 3 |
| 12 | 60.764546 | -13.351487 | 1.951575 | 15.136364 | 72.954545 | 521977590.909091 | 730457454.545455 | 3.277273 | 262445964.186818 | 24.824657 | 5.434401 | 22 |
| 13 | 25.640000 | 11.237908 | 1.322355 | 12.500000 | 130.500000 | 16755500000.000000 | 13654000000.000000 | 3.295000 | 2791829362.100000 | 13.649696 | 1.508484 | 2 |
| 14 | 16.756250 | -30.524515 | 3.611218 | 60.250000 | 11.875000 | -222070875.000000 | -2477074875.000000 | -3.087500 | 777103489.137500 | 82.316914 | -2.336984 | 8 |
| 15 | 276.570007 | 6.189286 | 1.116976 | 30.000000 | 25.000000 | 90885000.000000 | 596541000.000000 | 8.910000 | 66951851.850000 | 31.040405 | 129.064585 | 1 |
| 16 | 467.819992 | 3.154179 | 1.677609 | 15.000000 | 211.750000 | 430286000.000000 | 472058000.000000 | 10.147500 | 57262913.850000 | 61.702232 | 26.018639 | 4 |
for cl in stock2['HC_segments'].unique():
print('In cluster {}, the following companies are present:'.format(cl))
print(stock2[stock2['HC_segments'] == cl]['Security'].unique())
print()
In cluster 6, the following companies are present: ['American Airlines Group' 'BB&T Corporation' 'Discover Financial Services' 'The Walt Disney Company' 'Exelon Corp.' 'Gilead Sciences' 'International Business Machines' "McDonald's Corp." 'Mondelez International' 'MetLife Inc.' 'Altria Group Inc' 'Northern Trust Corp.' 'PepsiCo Inc.' 'Philip Morris International' 'Prudential Financial' 'United Continental Holdings' 'United Health Group Inc.' 'United Parcel Service' 'United Technologies'] In cluster 2, the following companies are present: ['AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'American International Group, Inc.' 'Albemarle Corp' 'Applied Materials Inc' 'Arconic Inc' 'Activision Blizzard' 'Broadcom' 'The Bank of New York Mellon Corp.' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Celgene Corp.' 'Comerica Inc.' 'Centene Corporation' 'Citrix Systems' 'Chevron Corp.' 'Du Pont (E.I.)' 'eBay Inc.' 'E*Trade' 'Edwards Lifesciences' 'Fastenal Co' 'FLIR Systems' 'Flowserve Corporation' 'FMC Corporation' 'First Solar Inc' 'Corning Inc.' 'General Motors' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'HP Inc.' 'Host Hotels & Resorts' 'Invesco Ltd.' 'Jacobs Engineering Group' 'Juniper Networks' 'L-3 Communications Holdings' 'Mastercard Inc.' 'Mattel Inc.' 'Marathon Petroleum' 'Merck & Co.' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'Newmont Mining Corp. (Hldg. Co.)' 'Norfolk Southern Corp.' 'Newell Brands' 'PayPal' 'Charles Schwab Corporation' 'SunTrust Banks' 'TripAdvisor' 'Total System Services' 'Texas Instruments' 'Vulcan Materials' 'Vertex Pharmaceuticals Inc' 'Wynn Resorts Ltd' 'Xerox Corp.' 'Zions Bancorp' 'Zoetis'] In cluster 7, the following companies are present: ['Analog Devices, Inc.' 'American Tower Corp A' 'AvalonBay Communities, Inc.' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'Ball Corp' 'Boston Properties' 'Chubb Limited' 'Crown Castle International Corp.' 'Carnival Corp.' 'Delta Air Lines' 'Quest Diagnostics' 'Delphi Automotive' 'Digital Realty Trust' 'Dr Pepper Snapple Group' 'Ecolab Inc.' 'Equifax Inc.' 'Equity Residential' 'Essex Property Trust, Inc.' 'Extra Space Storage' 'Fortune Brands Home & Security' 'Fiserv Inc' 'Federal Realty Investment Trust' 'Grainger (W.W.) Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Intl Flavors & Fragrances' 'Interpublic Group' 'Laboratory Corp. of America Holding' 'Lockheed Martin Corp.' 'Southwest Airlines' 'Mid-America Apartments' 'Mohawk Industries' 'McCormick & Co.' 'Mettler Toledo' 'NextEra Energy' 'Omnicom Group' "O'Reilly Automotive" 'PPG Industries' 'Royal Caribbean Cruises Ltd' 'Roper Industries' 'Republic Services Inc' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Simon Property Group Inc' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Thermo Fisher Scientific' 'Tyson Foods' 'Vornado Realty Trust' 'Dentsply Sirona' 'Zimmer Biomet Holdings'] In cluster 12, the following companies are present: ['Archer-Daniels-Midland Co' 'Akamai Technologies Inc' 'CF Industries Holdings Inc' 'Cummins Inc.' 'The Cooper Companies' 'EQT Corporation' "Expeditors Int'l" 'Halliburton Co.' 'Hasbro Inc.' 'HCA Holdings' 'Harley-Davidson' 'Hewlett Packard Enterprise' 'Kansas City Southern' 'Leucadia National Corp.' 'Martin Marietta Materials' 'The Mosaic Company' 'PACCAR Inc.' 'Quanta Services Inc.' 'Ryder System' 'Skyworks Solutions' 'Under Armour' 'Universal Health Services, Inc.'] In cluster 15, the following companies are present: ['Alliance Data Systems'] In cluster 4, the following companies are present: ['Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Alaska Air Group Inc' 'Allstate Corp' 'AMETEK Inc' 'Ameriprise Financial' 'AutoNation Inc' 'Aon plc' 'Amphenol Corp' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'Caterpillar Inc.' 'CBRE Group' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'CMS Energy' 'CenterPoint Energy' 'Capital One Financial' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'CVS Health' 'Dominion Resources' 'Deere & Co.' 'Danaher Corp.' 'Dun & Bradstreet' 'Dover Corp.' 'Duke Energy' 'DaVita Inc.' 'Consolidated Edison' "Edison Int'l" 'Eastman Chemical' 'Eversource Energy' 'Eaton Corporation' 'Entergy Corp.' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fluor Corp.' 'General Dynamics' 'General Growth Properties Inc.' 'Genuine Parts' 'Huntington Bancshares' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' "Honeywell Int'l Inc." 'The Hershey Company' 'Humana Inc.' 'IDEXX Laboratories' 'International Paper' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'J. B. Hunt Transport Services' 'Kimco Realty' 'Leggett & Platt' 'Lennar Corp.' 'LKQ Corporation' 'Lilly (Eli) & Co.' 'Alliant Energy Corp' 'Level 3 Communications' 'LyondellBasell' 'Macerich' "Marriott Int'l." 'Masco Corp.' "Moody's Corp" 'Mead Johnson' 'Marsh & McLennan' '3M Company' 'M&T Bank Corp.' 'Nielsen Holdings' 'Nucor Corp.' 'Realty Income Corporation' "People's United Financial" 'Pitney-Bowes' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPL Corp.' 'Phillips 66' 'Praxair Inc.' 'Robert Half International' 'SCANA Corp' 'Sealed Air' 'Southern Co.' 'Stericycle Inc' 'Sempra Energy' 'State Street Corp.' 'Synchrony Financial' 'Stryker Corp.' 'Torchmark Corp.' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tesoro Petroleum Co.' 'UDR Inc' 'Unum Group' 'Union Pacific' 'Varian Medical Systems' 'Valero Energy' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Xylem Inc.' 'Yum! Brands Inc'] In cluster 8, the following companies are present: ['Allegion' 'Charter Communications' 'Colgate-Palmolive' 'Kimberly-Clark' 'S&P Global, Inc.'] In cluster 11, the following companies are present: ['Alexion Pharmaceuticals' 'Amazon.com Inc' 'Netflix Inc.'] In cluster 10, the following companies are present: ['Affiliated Managers Group Inc' 'Anthem Inc.' 'CME Group Inc.' 'Discovery Communications-A' 'Discovery Communications-C' 'Expedia Inc.' 'Whirlpool Corp.'] In cluster 3, the following companies are present: ['Amgen Inc' 'Facebook' 'Frontier Communications' 'Monster Beverage' 'Waters Corporation' 'Yahoo Inc.'] In cluster 0, the following companies are present: ['Apache Corporation' 'Chesapeake Energy'] In cluster 1, the following companies are present: ['Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'Hess Corporation' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'Occidental Petroleum' 'Spectra Energy Corp.' 'Teradata Corp.' 'Cimarex Energy'] In cluster 13, the following companies are present: ['Bank of America Corp' 'Intel Corp.'] In cluster 5, the following companies are present: ['Citigroup Inc.' 'Ford Motor' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] In cluster 16, the following companies are present: ['Chipotle Mexican Grill' 'Equinix' 'Intuitive Surgical Inc.' 'Regeneron'] In cluster 14, the following companies are present: ['Cabot Oil & Gas' 'Freeport-McMoran Cp & Gld' 'Kinder Morgan' 'Marathon Oil Corp.' 'ONEOK' 'Range Resources Corp.' 'Southwestern Energy' 'Williams Cos.'] In cluster 9, the following companies are present: ['Priceline.com Inc']
stock2.groupby(['HC_segments', 'GICS Sector'])['Security'].count()
HC_segments GICS Sector
0 Energy 2
1 Energy 13
Information Technology 1
2 Consumer Discretionary 8
Energy 2
Financials 10
Health Care 11
Industrials 6
Information Technology 16
Materials 5
Real Estate 1
3 Consumer Staples 1
Health Care 2
Information Technology 2
Telecommunications Services 1
4 Consumer Discretionary 9
Consumer Staples 5
Energy 3
Financials 25
Health Care 8
Industrials 28
Information Technology 5
Materials 6
Real Estate 12
Telecommunications Services 2
Utilities 22
5 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 3
Health Care 1
Telecommunications Services 2
6 Consumer Discretionary 2
Consumer Staples 4
Financials 5
Health Care 2
Industrials 4
Information Technology 1
Utilities 1
7 Consumer Discretionary 9
Consumer Staples 5
Financials 1
Health Care 9
Industrials 8
Information Technology 2
Materials 5
Real Estate 13
Utilities 1
8 Consumer Discretionary 1
Consumer Staples 2
Financials 1
Industrials 1
9 Consumer Discretionary 1
10 Consumer Discretionary 4
Financials 2
Health Care 1
11 Consumer Discretionary 1
Health Care 1
Information Technology 1
12 Consumer Discretionary 3
Consumer Staples 1
Energy 2
Financials 1
Health Care 3
Industrials 6
Information Technology 3
Materials 3
13 Financials 1
Information Technology 1
14 Energy 7
Materials 1
15 Information Technology 1
16 Consumer Discretionary 1
Health Care 2
Real Estate 1
Name: Security, dtype: int64
plt.figure(figsize=(20, 20))
plt.suptitle('Boxplot of numerical variables for each cluster')
for i, variable in enumerate(num_col):
plt.subplot(3, 4, i + 1)
sns.boxplot(data = stock2, x='HC_segments', y=variable)
plt.tight_layout(pad=2.0)
Observations
K-means clustering
The data represents 13 segments (clusters) produced by applying the K-means clustering algorithm. Each segment has a different set of average values for the variables: current price, price change, volatility, return on equity (ROE), cash ratio, net cash flow, net income, earnings per share, estimated shares outstanding, P/E ratio, P/B ratio, and the count of data points in each segment.
KMCluster 0 contains 26 companies with a relatively high current price, a positive price change, moderate volatility, high ROE, high cash ratio, positive net cash flow, positive net income, and a positive earnings per share.
KMCluster 1 contains 25 companies with a lower current price, a negative price change, higher volatility, and negative earnings per share.
KMCluster 2 cluster contains 18 companies with a high current price, positive price change, moderate volatility, lower ROE, very high cash ratio, and positive net cash flow.
KMCluster 3 contains 7 companies with a very high current price, positive price change, low volatility, lower ROE, high cash ratio, positive net cash flow, and positive net income.
KMCluster 4 contains 81 companies with a moderate current price, positive price change, moderate volatility, moderate ROE, moderate cash ratio, negative net cash flow, and positive net income.
KMCluster 5 contains 10 companies with a relatively high current price, positive price change, moderate volatility, low ROE, moderate cash ratio, negative net cash flow, and positive net income.
Cluster 6 contains 151 companies with a moderate current price, positive price change, moderate volatility, moderate ROE, moderate cash ratio, positive net cash flow, and positive net income.
KMCluster 7 contains 5 companies with a high current price, positive price change, low volatility, very high ROE, moderate cash ratio, negative net cash flow, and positive net income.
KMCluster 8 contains 3 companies with a low current price, negative price change, high volatility, very high ROE, high cash ratio, negative net cash flow, and negative earnings per share.
KMCluster 9 contains 8 companies with a moderate current price, positive price change, low volatility, moderate ROE, high cash ratio, negative net cash flow, and positive net income.
KMCluster 10 contains 2 companies, Bank of America Corp and Intel Corp, with. a low current price, positive price change, moderate volatility, low ROE, very high cash ratio, positive net cash flow, and positive net income.
KMCluster 11 has only one company -Priceline.com Inc., with a very high current price, positive price change, moderate volatility, lower ROE, high cash ratio, negative net cash flow, and positive net income.
KMCluster 12 contains 3 companies with a high current price, positive price change, high volatility, very low ROE, high cash ratio, positive net cash flow, and negative earnings per share.
Hierarchical Clustering
The data represents the stock prices and financial assets of different companies. The data is grouped into 17 clusters based on hierarchical clustering.
HCluster 0 contains 2 companies with low stock prices, high price changes, and negative earnings per share (EPS).
HCluster 1 contains 5 companies with high stock prices, low volatility, and positive EPS. They have moderate cash ratios and net cash flows.
HCluster 2 contains 22 companies with moderate stock prices, low price changes, and moderate cash ratios. They have moderate net cash flows and positive EPS.
HCluster 3 contains 14 companies with moderate stock prices, low price changes, and high return on equity (ROE). They have negative net cash flows and negative EPS.
HCluster 4 contains 9 companies with low stock prices, high price changes, and negative price-to-book (P/B) ratios. They have negative net cash flows and positive EPS.
HCluster 5 contains 49 companies with moderate stock prices, high price changes, and moderate cash ratios. They have negative net cash flows and positive EPS.
HCluster 6 contains 125 companies with moderate stock prices, moderate price changes, and moderate cash ratios. They have positive net cash flows and positive EPS.
HCluster 7 contains 3 companies with high stock prices, high price changes, and low P/E ratios. They have high cash ratios and negative EPS.
HCluster 8 contains 2 companies with low stock prices, high price changes, and positive P/B ratios. They have high cash ratios and positive EPS.
HCluster 9 contains 6 companies with moderate stock prices, high price changes, and high cash ratios. They have positive net cash flows and positive EPS.
HCluster 10 contains 8 companies with low stock prices, high price changes, and negative EPS. They have negative net cash flows and high P/E ratios.
HCluster 11 contains 1 company with a very high stock price and high P/E ratio. They have negative net cash flows and positive EPS.
HCluster 12 contains 5 companies with moderate stock prices, high price changes, and high ROE. They have negative net cash flows and negative P/B ratios.
HCluster 13 contains 7 companies with moderate stock prices, low price changes, and negative EPS. They have positive net cash flows and negative P/B ratios.
HCluster 14 contains 10 companies with moderate stock prices, high price changes, and moderate cash ratios. They have positive net cash flows and positive P/B ratios.
HCluster 15 contains 53 companies with moderate stock prices, high price changes, and moderate cash ratios. They have positive net cash flows and positive EPS.
HCluster 16 contains 19 companies with moderate stock prices, low price changes, and high ROE. They have positive net cash flows and positive P/B ratios.